Analysis of data for Dr. Stefano Allesina’s workshop: “A Skeptic’s Guide to Scientific Writing.”

I think for the most part I am going to focus only on Articles.

Dr. Allesina visualized some of the data in a time-series format here. I think I am just going to look at the data regardless of time.

1 Distributions

1.1 Citations

1.2 Views

1.3 Authors

1.4 Figures

1.5 Words in the abstract

1.6 Words in the title

1.7 Words in the dictionary in the abstract

Filtered out “0” values.

1.8 Simple words in the abstract

Filtered out “0” values.

1.9 References

2 Citations

2.1 Authors

2.2 Countries

2.3 Equations

3 Correlation of views and citations

Melissa brought up an interesting question as to how related the number of veiws and citations are. Stefano mentioned that for some types of documents they may not be closely related at all.

4 How does the number of words in the title affect the number of citations?

4.1 Visualize

4.2 Visualize by year

4.3 GLM


Call:
glm(formula = log(num_citations + 1) ~ num_words_title, data = dat)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.9485  -0.8668   0.1172   0.9335   5.2022  

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      3.001186   0.050745  59.142  < 2e-16 ***
num_words_title -0.026355   0.003891  -6.774 1.35e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 1.834748)

    Null deviance: 13241  on 7172  degrees of freedom
Residual deviance: 13157  on 7171  degrees of freedom
AIC: 24713

Number of Fisher Scoring iterations: 2
[1] "p-value"
    (Intercept) num_words_title 
   0.000000e+00    1.353578e-11 

4.4 GLM year as factor


Call:
glm(formula = log(num_citations + 1) ~ as.factor(year) + num_words_title, 
    data = dat)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.1902  -0.5138  -0.0089   0.5175   4.8916  

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)          3.695139   0.109791  33.656  < 2e-16 ***
as.factor(year)2006  0.507269   0.130068   3.900 9.71e-05 ***
as.factor(year)2007  0.345912   0.122713   2.819 0.004833 ** 
as.factor(year)2008  0.243100   0.119650   2.032 0.042215 *  
as.factor(year)2009  0.212868   0.116566   1.826 0.067868 .  
as.factor(year)2010  0.044285   0.115208   0.384 0.700700    
as.factor(year)2011 -0.064313   0.115291  -0.558 0.576980    
as.factor(year)2012 -0.268353   0.113519  -2.364 0.018108 *  
as.factor(year)2013 -0.395434   0.113180  -3.494 0.000479 ***
as.factor(year)2014 -0.565502   0.112896  -5.009 5.60e-07 ***
as.factor(year)2015 -0.701157   0.112509  -6.232 4.87e-10 ***
as.factor(year)2016 -0.913665   0.112968  -8.088 7.08e-16 ***
as.factor(year)2017 -1.160244   0.112953 -10.272  < 2e-16 ***
as.factor(year)2018 -1.576806   0.113160 -13.934  < 2e-16 ***
as.factor(year)2019 -2.121371   0.112215 -18.904  < 2e-16 ***
as.factor(year)2020 -3.107982   0.111647 -27.838  < 2e-16 ***
as.factor(year)2021 -3.609822   0.144289 -25.018  < 2e-16 ***
num_words_title     -0.006114   0.002517  -2.429 0.015158 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 0.7517154)

    Null deviance: 13241.2  on 7172  degrees of freedom
Residual deviance:  5378.5  on 7155  degrees of freedom
AIC: 18329

Number of Fisher Scoring iterations: 2
[1] "p-value"
        (Intercept) as.factor(year)2006 as.factor(year)2007 as.factor(year)2008 
      1.174960e-230        9.705879e-05        4.832592e-03        4.221465e-02 
as.factor(year)2009 as.factor(year)2010 as.factor(year)2011 as.factor(year)2012 
       6.786827e-02        7.006998e-01        5.769799e-01        1.810777e-02 
as.factor(year)2013 as.factor(year)2014 as.factor(year)2015 as.factor(year)2016 
       4.789688e-04        5.600388e-07        4.865478e-10        7.077134e-16 
as.factor(year)2017 as.factor(year)2018 as.factor(year)2019 as.factor(year)2020 
       1.396407e-24        1.448490e-43        8.071434e-78       5.115671e-162 
as.factor(year)2021     num_words_title 
      1.692719e-132        1.515838e-02 

5 Titles with “:”

[1] "Number of titles with a colon"
[1] 1442
[1] "Proportion of titles with a colon"
[1] 0.2010316

5.1 Unique characters in the title

# A tibble: 107 x 2
   value     n
   <chr> <int>
 1 " "   84428
 2 " "       1
 3 "_"       2
 4 "-"    3915
 5 "–"      13
 6 "—"      14
 7 ","     565
 8 ";"       9
 9 ":"    1445
10 "!"       2
# … with 97 more rows

6 By year

6.1 Values

6.2 Variance